NoSQL operator: muxtotable

Converts an unordered list of name/value pairs into a NoSQL table.

Usage: muxtotable [options] [table]

Options:
    --input (-i) 'file'
      Read input from 'file' instead of STDIN.

    --output (-o) 'file'
      Write output to 'file' instead of STDOUT.

    --help (-h)
      Display this help text.

    --no-header (-N)
      Remove header from output.

    --continuation (-c) 'string'
      A line of data beginning with 'string' is assumed to be the
      continuation of the previous line. If this option is omitted
      then the default continuation marker is the '@' character.

    --skip (-s) 'pattern'
      Skip input lines matching 'pattern'. The latter must be a
      valid AWK regular expression, without surrounding slashes.

    --key-columns (-K) col1[,col2,col3,...]
      Name, or comma-separated list of names, that must be regarded as
      the table key column(s) when encountered in the 'name' field of
      the input stream. See also the optional 'table' argument below.

    'table'
      If this argument is specified it is expected to be the filename
      of the table the input data belongs to; alternatively, it can be
      the filename of an index file of that same table. See the notes
      below for more explanations.

Notes:

The basic idea behind the 'muxtotable' operator is that a valid NoSQL
table can be built out of an (mostly) unordered list of name/value pairs,
one pair per line, with the 'value' portion possibly spanning over
multiple lines as signaled by a special continuation character or string
of characters (see the '-c' option above). Unlike the more structured
and _strictly_ ordered NoSQL 'list' format (see the 'listtotable and
'tabletolist' operators), name/value entries in a "multiplexed" (or
"muxed", for short) stream of data can occur in an unordered sequence,
interleaved with one another as desired, given only a minimal set of
constraints.

The optional 'table' command-line argument deserves some explanations.
According to NoSQL's specs, a table filename may contain the table key
column(s), so that if an operator (such as 'muxtotable') needs to know
that piece of information, and if the latter is not explicitly specified
on the command-line, it can be inferred from a properly-construed table
filename.

For instance, let's say we have a table called 'mytable' which primary
key field is 'Col1', then the relevant file can be called 'mytable._k.Col1'.
Likewise, if the table primary key is a concatenation of multiple columns,
say 'Col1' and 'Col2', the file name will be 'mytable._k.Col1.Col2'. Note
that this naming convention is not mandatory and a file hosting a NoSQL
table can be called anything, but if these rules are applied then that
filename will be "self-explanatory" for a number of NoSQL operators,
including 'muxtotable', which will be able to infer key- and index-related
information from such a file naming scheme.

A NoSQL table can have secondary index files associated with it, and the
names of those files, too, may tell something about the table columns
they refer to. So, let's say the file hosting 'mytable' is called
'mytable._k.Col1.Col2', a secondary index file on columns 'Col3' and
'Col4' of 'mytable' can have name 'mytable._k.Col1.Col2._x.Col3.Col4'.
In this way an operator may acquire much of the key- and index-related
knowledge by simply reading the involved filename.

The '_k' and '_x' indicators were choosen because no real column name
in a NoSQL table should begin with an uderscore, so there's no risk of
ambiguities. See also 'updtable --help' for another example of use of
said naming convention.

The file naming scheme just described dates back to when a true Database
Schema had not yet been introduced in NoSQL (see the 'constraint' operator).
A better way of handling key- and index- related knowledge would therefore
be to read it from the database schema file, like other operators already do.
That will eventually be introduced in future releases of NoSQL, but short
of that the old conventions apply, and they probably always will, to retain
backward compatibility.

Example 1:

Let's assume we have a NoSQL table named 'order', like this:

   Order   Customer   Amount   Date
   ------  --------   ------   ----------
   so-001  cust01      12.99   05/09/2010
   so-002  cust02      93.28   11/11/2010
   so-003  cust02      67.11   19/11/2010

Normally, an application program that needs to update the 'Amount' value
associated with the record having primary key 'so-002' would need to
create a valid table with the updated pieces of information and then
feed it to the 'updtable' operator (see 'updtable --help' for more on
this). Alternatively, the application program could simply output a
muxed structure containing the desired updates, like this:

   Order so-002
   Amount 58.76

Assuming that the above updates are stored by the application program
in a file named 'updates', then for that data to be merged into the
'order' table the application would do:


   muxtotable --key-columns Order < updates |
			 updtable --stdin order | justify

and the updated 'order' table printed to STDOUT will be like this:

   Order   Customer   Amount   Date
   ------  --------   ------   ----------
   so-001  cust01      12.99   05/09/2010
   so-002  cust02      58.76   11/11/2010
   so-003  cust02      67.11   19/11/2010


Note the 'Amount' value changed from 93.28 to 58.76 for order 'so-002'.

As it is usual with NoSQL, no actual changes occur to the original 'order'
table file on disk; it is up to the application program to decide whether
those changes are to be made permanent, by capturing the result into a
workfile and then use the latter to overwrite the original table, handling
any write concurrency and data integrity issues along the way if needed,
either directly of with he help of the relevant NoSQL operators and
utilities, like 'constraint' and the like.

NoSQL's muxed format is quite flexible; for instance, if multiple values
are listed for the 'Amount' field under the same 'Order' value, the last
occurrence will take over. That is, if the 'updates' file contains the
following:

   Order so-002
   Amount 33.28
   Amount 13.28

then the updated 'Amount' field printed on STDOUT will be 13.28 .

Generally speaking, NoSQL's muxed format can come handy whenever either
we do not want to have to bother about table structures in our application
programs, or if we have table values that "come-in at random" at that need
to be stacked into a workfile that is later to be turned into a valid table.
Back to the above example, say our 'updates' file contains the following
name/value pairs:

   Order so-002
   Amount 33.28
   Amount 13.28
   Date 05/09/2010
   Amount 1.23
   Order so-002
   Date 01/01/2011
   Order so-001
   Order so-003
   Customer cust04

If we process such 'updates' file through the following pipeline:

   muxtotable --key-columns Order < updates | justify

here's what we get:

   Order   Amount  Date        Customer
   ------  ------  ----------  --------
   so-002    1.23  05/09/2010          
   so-002          01/01/2011          
   so-001                              
   so-003                      cust04

If neither the '--key-columns' option is specified, nor the key column
name(s) can be inferred form the specified file-name (as it is the
case for the 'updates' file in the above example), then the very first
field name listed in the muxed structure will be taken as the table
primary key ('Order', in this case). Conversely, if the key name(s) is
specified, then if a non-key field is encountered in the input stream
before at least one value for each specified key is received, then
'muxtotable' will exit with an error message. The check for every
key to be not-null is repeated every time a key field is encountered
after a non-key one, to ensure that every new "logical block" of
input fully declares the key value(s) it refers to.

For the above updates to actually be merged into the 'order' table
the complete pipeline needed is:

   muxtotable --key-columns Order < updates |
	updtable --stdin --key-columns Order --fallback order | justify

and here's the final result printed to STDOUT:

   Order   Customer  Amount  Date
   ------  --------  ------  ----------
   so-001  cust01     12.99  05/09/2010
   so-002  cust02      1.23  01/01/2011
   so-003  cust04     67.11  19/11/2010

Consistently with what was previuosly explained about "self-describing"
file names, the 'updates' file could have been called 'updates._k.Order',
in which case no '--key-columns' option would have been necessary to
explicitly tell 'muxtotable' what we want it to consider as the table
key field(s).

See the source code of the 'muxtotable' operator for more information
on how the program behaves depending on the content of a muxed stream.
Back